<!DOCTYPE html>
<HTML lang = "en">
<HEAD>
  <meta charset="UTF-8"/>
  <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes">
  <title>From Optimization to Probabilistic Programming</title>
  

  <script type="text/x-mathjax-config">
    MathJax.Hub.Config({
      tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]},
      TeX: { equationNumbers: { autoNumber: "AMS" } }
    });
  </script>

  <script type="text/javascript" async src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
  </script>

  
<style>
pre.hljl {
    border: 1px solid #ccc;
    margin: 5px;
    padding: 5px;
    overflow-x: auto;
    color: rgb(68,68,68); background-color: rgb(251,251,251); }
pre.hljl > span.hljl-t { }
pre.hljl > span.hljl-w { }
pre.hljl > span.hljl-e { }
pre.hljl > span.hljl-eB { }
pre.hljl > span.hljl-o { }
pre.hljl > span.hljl-k { color: rgb(148,91,176); font-weight: bold; }
pre.hljl > span.hljl-kc { color: rgb(59,151,46); font-style: italic; }
pre.hljl > span.hljl-kd { color: rgb(214,102,97); font-style: italic; }
pre.hljl > span.hljl-kn { color: rgb(148,91,176); font-weight: bold; }
pre.hljl > span.hljl-kp { color: rgb(148,91,176); font-weight: bold; }
pre.hljl > span.hljl-kr { color: rgb(148,91,176); font-weight: bold; }
pre.hljl > span.hljl-kt { color: rgb(148,91,176); font-weight: bold; }
pre.hljl > span.hljl-n { }
pre.hljl > span.hljl-na { }
pre.hljl > span.hljl-nb { }
pre.hljl > span.hljl-nbp { }
pre.hljl > span.hljl-nc { }
pre.hljl > span.hljl-ncB { }
pre.hljl > span.hljl-nd { color: rgb(214,102,97); }
pre.hljl > span.hljl-ne { }
pre.hljl > span.hljl-neB { }
pre.hljl > span.hljl-nf { color: rgb(66,102,213); }
pre.hljl > span.hljl-nfm { color: rgb(66,102,213); }
pre.hljl > span.hljl-np { }
pre.hljl > span.hljl-nl { }
pre.hljl > span.hljl-nn { }
pre.hljl > span.hljl-no { }
pre.hljl > span.hljl-nt { }
pre.hljl > span.hljl-nv { }
pre.hljl > span.hljl-nvc { }
pre.hljl > span.hljl-nvg { }
pre.hljl > span.hljl-nvi { }
pre.hljl > span.hljl-nvm { }
pre.hljl > span.hljl-l { }
pre.hljl > span.hljl-ld { color: rgb(148,91,176); font-style: italic; }
pre.hljl > span.hljl-s { color: rgb(201,61,57); }
pre.hljl > span.hljl-sa { color: rgb(201,61,57); }
pre.hljl > span.hljl-sb { color: rgb(201,61,57); }
pre.hljl > span.hljl-sc { color: rgb(201,61,57); }
pre.hljl > span.hljl-sd { color: rgb(201,61,57); }
pre.hljl > span.hljl-sdB { color: rgb(201,61,57); }
pre.hljl > span.hljl-sdC { color: rgb(201,61,57); }
pre.hljl > span.hljl-se { color: rgb(59,151,46); }
pre.hljl > span.hljl-sh { color: rgb(201,61,57); }
pre.hljl > span.hljl-si { }
pre.hljl > span.hljl-so { color: rgb(201,61,57); }
pre.hljl > span.hljl-sr { color: rgb(201,61,57); }
pre.hljl > span.hljl-ss { color: rgb(201,61,57); }
pre.hljl > span.hljl-ssB { color: rgb(201,61,57); }
pre.hljl > span.hljl-nB { color: rgb(59,151,46); }
pre.hljl > span.hljl-nbB { color: rgb(59,151,46); }
pre.hljl > span.hljl-nfB { color: rgb(59,151,46); }
pre.hljl > span.hljl-nh { color: rgb(59,151,46); }
pre.hljl > span.hljl-ni { color: rgb(59,151,46); }
pre.hljl > span.hljl-nil { color: rgb(59,151,46); }
pre.hljl > span.hljl-noB { color: rgb(59,151,46); }
pre.hljl > span.hljl-oB { color: rgb(102,102,102); font-weight: bold; }
pre.hljl > span.hljl-ow { color: rgb(102,102,102); font-weight: bold; }
pre.hljl > span.hljl-p { }
pre.hljl > span.hljl-c { color: rgb(153,153,119); font-style: italic; }
pre.hljl > span.hljl-ch { color: rgb(153,153,119); font-style: italic; }
pre.hljl > span.hljl-cm { color: rgb(153,153,119); font-style: italic; }
pre.hljl > span.hljl-cp { color: rgb(153,153,119); font-style: italic; }
pre.hljl > span.hljl-cpB { color: rgb(153,153,119); font-style: italic; }
pre.hljl > span.hljl-cs { color: rgb(153,153,119); font-style: italic; }
pre.hljl > span.hljl-csB { color: rgb(153,153,119); font-style: italic; }
pre.hljl > span.hljl-g { }
pre.hljl > span.hljl-gd { }
pre.hljl > span.hljl-ge { }
pre.hljl > span.hljl-geB { }
pre.hljl > span.hljl-gh { }
pre.hljl > span.hljl-gi { }
pre.hljl > span.hljl-go { }
pre.hljl > span.hljl-gp { }
pre.hljl > span.hljl-gs { }
pre.hljl > span.hljl-gsB { }
pre.hljl > span.hljl-gt { }
</style>



  <style type="text/css">
  @font-face {
  font-style: normal;
  font-weight: 300;
}
@font-face {
  font-style: normal;
  font-weight: 400;
}
@font-face {
  font-style: normal;
  font-weight: 600;
}
html {
  font-family: sans-serif; /* 1 */
  -ms-text-size-adjust: 100%; /* 2 */
  -webkit-text-size-adjust: 100%; /* 2 */
}
body {
  margin: 0;
}
article,
aside,
details,
figcaption,
figure,
footer,
header,
hgroup,
main,
menu,
nav,
section,
summary {
  display: block;
}
audio,
canvas,
progress,
video {
  display: inline-block; /* 1 */
  vertical-align: baseline; /* 2 */
}
audio:not([controls]) {
  display: none;
  height: 0;
}
[hidden],
template {
  display: none;
}
a:active,
a:hover {
  outline: 0;
}
abbr[title] {
  border-bottom: 1px dotted;
}
b,
strong {
  font-weight: bold;
}
dfn {
  font-style: italic;
}
h1 {
  font-size: 2em;
  margin: 0.67em 0;
}
mark {
  background: #ff0;
  color: #000;
}
small {
  font-size: 80%;
}
sub,
sup {
  font-size: 75%;
  line-height: 0;
  position: relative;
  vertical-align: baseline;
}
sup {
  top: -0.5em;
}
sub {
  bottom: -0.25em;
}
img {
  border: 0;
}
svg:not(:root) {
  overflow: hidden;
}
figure {
  margin: 1em 40px;
}
hr {
  -moz-box-sizing: content-box;
  box-sizing: content-box;
  height: 0;
}
pre {
  overflow: auto;
}
code,
kbd,
pre,
samp {
  font-family: monospace, monospace;
  font-size: 1em;
}
button,
input,
optgroup,
select,
textarea {
  color: inherit; /* 1 */
  font: inherit; /* 2 */
  margin: 0; /* 3 */
}
button {
  overflow: visible;
}
button,
select {
  text-transform: none;
}
button,
html input[type="button"], /* 1 */
input[type="reset"],
input[type="submit"] {
  -webkit-appearance: button; /* 2 */
  cursor: pointer; /* 3 */
}
button[disabled],
html input[disabled] {
  cursor: default;
}
button::-moz-focus-inner,
input::-moz-focus-inner {
  border: 0;
  padding: 0;
}
input {
  line-height: normal;
}
input[type="checkbox"],
input[type="radio"] {
  box-sizing: border-box; /* 1 */
  padding: 0; /* 2 */
}
input[type="number"]::-webkit-inner-spin-button,
input[type="number"]::-webkit-outer-spin-button {
  height: auto;
}
input[type="search"] {
  -webkit-appearance: textfield; /* 1 */
  -moz-box-sizing: content-box;
  -webkit-box-sizing: content-box; /* 2 */
  box-sizing: content-box;
}
input[type="search"]::-webkit-search-cancel-button,
input[type="search"]::-webkit-search-decoration {
  -webkit-appearance: none;
}
fieldset {
  border: 1px solid #c0c0c0;
  margin: 0 2px;
  padding: 0.35em 0.625em 0.75em;
}
legend {
  border: 0; /* 1 */
  padding: 0; /* 2 */
}
textarea {
  overflow: auto;
}
optgroup {
  font-weight: bold;
}
table {
  font-family: monospace, monospace;
  font-size : 0.8em;
  border-collapse: collapse;
  border-spacing: 0;
}
td,
th {
  padding: 0;
}
thead th {
    border-bottom: 1px solid black;
    background-color: white;
}
tr:nth-child(odd){
  background-color: rgb(248,248,248);
}


/*
* Skeleton V2.0.4
* Copyright 2014, Dave Gamache
* www.getskeleton.com
* Free to use under the MIT license.
* http://www.opensource.org/licenses/mit-license.php
* 12/29/2014
*/
.container {
  position: relative;
  width: 100%;
  max-width: 960px;
  margin: 0 auto;
  padding: 0 20px;
  box-sizing: border-box; }
.column,
.columns {
  width: 100%;
  float: left;
  box-sizing: border-box; }
@media (min-width: 400px) {
  .container {
    width: 85%;
    padding: 0; }
}
@media (min-width: 550px) {
  .container {
    width: 80%; }
  .column,
  .columns {
    margin-left: 4%; }
  .column:first-child,
  .columns:first-child {
    margin-left: 0; }

  .one.column,
  .one.columns                    { width: 4.66666666667%; }
  .two.columns                    { width: 13.3333333333%; }
  .three.columns                  { width: 22%;            }
  .four.columns                   { width: 30.6666666667%; }
  .five.columns                   { width: 39.3333333333%; }
  .six.columns                    { width: 48%;            }
  .seven.columns                  { width: 56.6666666667%; }
  .eight.columns                  { width: 65.3333333333%; }
  .nine.columns                   { width: 74.0%;          }
  .ten.columns                    { width: 82.6666666667%; }
  .eleven.columns                 { width: 91.3333333333%; }
  .twelve.columns                 { width: 100%; margin-left: 0; }

  .one-third.column               { width: 30.6666666667%; }
  .two-thirds.column              { width: 65.3333333333%; }

  .one-half.column                { width: 48%; }

  /* Offsets */
  .offset-by-one.column,
  .offset-by-one.columns          { margin-left: 8.66666666667%; }
  .offset-by-two.column,
  .offset-by-two.columns          { margin-left: 17.3333333333%; }
  .offset-by-three.column,
  .offset-by-three.columns        { margin-left: 26%;            }
  .offset-by-four.column,
  .offset-by-four.columns         { margin-left: 34.6666666667%; }
  .offset-by-five.column,
  .offset-by-five.columns         { margin-left: 43.3333333333%; }
  .offset-by-six.column,
  .offset-by-six.columns          { margin-left: 52%;            }
  .offset-by-seven.column,
  .offset-by-seven.columns        { margin-left: 60.6666666667%; }
  .offset-by-eight.column,
  .offset-by-eight.columns        { margin-left: 69.3333333333%; }
  .offset-by-nine.column,
  .offset-by-nine.columns         { margin-left: 78.0%;          }
  .offset-by-ten.column,
  .offset-by-ten.columns          { margin-left: 86.6666666667%; }
  .offset-by-eleven.column,
  .offset-by-eleven.columns       { margin-left: 95.3333333333%; }

  .offset-by-one-third.column,
  .offset-by-one-third.columns    { margin-left: 34.6666666667%; }
  .offset-by-two-thirds.column,
  .offset-by-two-thirds.columns   { margin-left: 69.3333333333%; }

  .offset-by-one-half.column,
  .offset-by-one-half.columns     { margin-left: 52%; }

}
html {
  font-size: 62.5%; }
body {
  font-size: 1.5em; /* currently ems cause chrome bug misinterpreting rems on body element */
  line-height: 1.6;
  font-weight: 400;
  font-family: "Raleway", "HelveticaNeue", "Helvetica Neue", Helvetica, Arial, sans-serif;
  color: #222; }
h1, h2, h3, h4, h5, h6 {
  margin-top: 0;
  margin-bottom: 2rem;
  font-weight: 300; }
h1 { font-size: 3.6rem; line-height: 1.2;  letter-spacing: -.1rem;}
h2 { font-size: 3.4rem; line-height: 1.25; letter-spacing: -.1rem; }
h3 { font-size: 3.2rem; line-height: 1.3;  letter-spacing: -.1rem; }
h4 { font-size: 2.8rem; line-height: 1.35; letter-spacing: -.08rem; }
h5 { font-size: 2.4rem; line-height: 1.5;  letter-spacing: -.05rem; }
h6 { font-size: 1.5rem; line-height: 1.6;  letter-spacing: 0; }

p {
  margin-top: 0; }
a {
  color: #1EAEDB; }
a:hover {
  color: #0FA0CE; }
.button,
button,
input[type="submit"],
input[type="reset"],
input[type="button"] {
  display: inline-block;
  height: 38px;
  padding: 0 30px;
  color: #555;
  text-align: center;
  font-size: 11px;
  font-weight: 600;
  line-height: 38px;
  letter-spacing: .1rem;
  text-transform: uppercase;
  text-decoration: none;
  white-space: nowrap;
  background-color: transparent;
  border-radius: 4px;
  border: 1px solid #bbb;
  cursor: pointer;
  box-sizing: border-box; }
.button:hover,
button:hover,
input[type="submit"]:hover,
input[type="reset"]:hover,
input[type="button"]:hover,
.button:focus,
button:focus,
input[type="submit"]:focus,
input[type="reset"]:focus,
input[type="button"]:focus {
  color: #333;
  border-color: #888;
  outline: 0; }
.button.button-primary,
button.button-primary,
input[type="submit"].button-primary,
input[type="reset"].button-primary,
input[type="button"].button-primary {
  color: #FFF;
  background-color: #33C3F0;
  border-color: #33C3F0; }
.button.button-primary:hover,
button.button-primary:hover,
input[type="submit"].button-primary:hover,
input[type="reset"].button-primary:hover,
input[type="button"].button-primary:hover,
.button.button-primary:focus,
button.button-primary:focus,
input[type="submit"].button-primary:focus,
input[type="reset"].button-primary:focus,
input[type="button"].button-primary:focus {
  color: #FFF;
  background-color: #1EAEDB;
  border-color: #1EAEDB; }
input[type="email"],
input[type="number"],
input[type="search"],
input[type="text"],
input[type="tel"],
input[type="url"],
input[type="password"],
textarea,
select {
  height: 38px;
  padding: 6px 10px; /* The 6px vertically centers text on FF, ignored by Webkit */
  background-color: #fff;
  border: 1px solid #D1D1D1;
  border-radius: 4px;
  box-shadow: none;
  box-sizing: border-box; }
/* Removes awkward default styles on some inputs for iOS */
input[type="email"],
input[type="number"],
input[type="search"],
input[type="text"],
input[type="tel"],
input[type="url"],
input[type="password"],
textarea {
  -webkit-appearance: none;
     -moz-appearance: none;
          appearance: none; }
textarea {
  min-height: 65px;
  padding-top: 6px;
  padding-bottom: 6px; }
input[type="email"]:focus,
input[type="number"]:focus,
input[type="search"]:focus,
input[type="text"]:focus,
input[type="tel"]:focus,
input[type="url"]:focus,
input[type="password"]:focus,
textarea:focus,
select:focus {
  border: 1px solid #33C3F0;
  outline: 0; }
label,
legend {
  display: block;
  margin-bottom: .5rem;
  font-weight: 600; }
fieldset {
  padding: 0;
  border-width: 0; }
input[type="checkbox"],
input[type="radio"] {
  display: inline; }
label > .label-body {
  display: inline-block;
  margin-left: .5rem;
  font-weight: normal; }
ul {
  list-style: circle; }
ol {
  list-style: decimal; }
ul ul,
ul ol,
ol ol,
ol ul {
  margin: 1.5rem 0 1.5rem 3rem;
  font-size: 90%; }
li > p {margin : 0;}
th,
td {
  padding: 12px 15px;
  text-align: left;
  border-bottom: 1px solid #E1E1E1; }
th:first-child,
td:first-child {
  padding-left: 0; }
th:last-child,
td:last-child {
  padding-right: 0; }
button,
.button {
  margin-bottom: 1rem; }
input,
textarea,
select,
fieldset {
  margin-bottom: 1.5rem; }
pre,
blockquote,
dl,
figure,
table,
p,
ul,
ol,
form {
  margin-bottom: 1.0rem; }
.u-full-width {
  width: 100%;
  box-sizing: border-box; }
.u-max-full-width {
  max-width: 100%;
  box-sizing: border-box; }
.u-pull-right {
  float: right; }
.u-pull-left {
  float: left; }
hr {
  margin-top: 3rem;
  margin-bottom: 3.5rem;
  border-width: 0;
  border-top: 1px solid #E1E1E1; }
.container:after,
.row:after,
.u-cf {
  content: "";
  display: table;
  clear: both; }

pre {
  display: block;
  padding: 9.5px;
  margin: 0 0 10px;
  font-size: 13px;
  line-height: 1.42857143;
  word-break: break-all;
  word-wrap: break-word;
  border: 1px solid #ccc;
  border-radius: 4px;
}

pre.hljl {
  margin: 0 0 10px;
  display: block;
  background: #f5f5f5;
  border-radius: 4px;
  padding : 5px;
}

pre.output {
  background: #ffffff;
}

pre.code {
  background: #ffffff;
}

pre.julia-error {
  color : red
}

code,
kbd,
pre,
samp {
  font-family: Menlo, Monaco, Consolas, "Courier New", monospace;
  font-size: 0.9em;
}


@media (min-width: 400px) {}
@media (min-width: 550px) {}
@media (min-width: 750px) {}
@media (min-width: 1000px) {}
@media (min-width: 1200px) {}

h1.title {margin-top : 20px}
img {max-width : 100%}
div.title {text-align: center;}

  </style>
</HEAD>

<BODY>
  <div class ="container">
    <div class = "row">
      <div class = "col-md-12 twelve columns">
        <div class="title">
          <h1 class="title">From Optimization to Probabilistic Programming</h1>
          <h5>Chris Rackauckas</h5>
          <h5>December 12th, 2020</h5>
        </div>

        <p>With a high degree of probability, all things are probabilistic. In all of the cases we have previously looked at &#40;differential equations, neural networks, neural differential equations, physics-informed neural networks, etc.&#41; we have incorporated data into our models using point estimates, i.e. getting &quot;exact fits&quot;. However, data has noise and uncertainty. We want to extend our previous modeling approaches to include probablistic estimates. This is known as <em>probabilistic programming</em>, or Bayesian estimation on general programming models. To approach this topic, we will first introduce the Bayesian way of thinking about variables as random variables, estimating probabilistic programs, and how efficient probabilistic programming frameworks incorporate differentiable programming.</p>
<h2>Bayesian Modeling in a Nutshell</h2>
<p>The idea of Bayesian modeling is to treat your variables as a random variable with respect to some distribution. As a starting point, think about the linear model</p>
<p class="math">\[
f(x) = ax
\]</p>
<p>The standard way to think of the linear model is that <span class="math">$a$</span> is a variables, and so you put a value <span class="math">$x$</span> in and compute <span class="math">$ax$</span>. However, in the Bayesian sense, the value of <span class="math">$a$</span> can be a <em>random variable</em>. A random variable <span class="math">$Z$</span> is a variable which has probability of taking certain values from a <em>probability distribution</em>. If we say that <span class="math">$Z \sim f(y)$</span>, then we are saying that the probability that <span class="math">$Z$</span> takes a value in the set <span class="math">$\Omega$</span> is:</p>
<p class="math">\[
\int_\Omega f(y)dy
\]</p>
<p>For example, if <span class="math">$Z$</span> is a scalar, then the probability that <span class="math">$Z \in [0,1]$</span> is:</p>
<p class="math">\[
\int_0^1 f(y)dy
\]</p>
<p>Discrete probability distributions can be handled by either using distribution quantities and measures in the integral, or by simply saying <span class="math">$f(y)$</span> is the probability that <span class="math">$Z = y$</span>.</p>
<p>Given this representation of variables, <span class="math">$ax$</span> where <span class="math">$a$</span> follows a probability distribution induces a probability distribution on <span class="math">$f(x)$</span>. To numerically aquire this distribution, one can use <em>Monte Carlo sampling</em>. This is simply the repeat process of:</p>
<ol>
<li><p>Sample variables</p>
</li>
<li><p>Compute output</p>
</li>
</ol>
<p>Doing this repeatedly then produces samples of <span class="math">$f(x)$</span> from which a numerical representation of the distribution can be had. From there, going to a multivariable linear model like <span class="math">$f(x) = Ax$</span> is the same idea. Going to <span class="math">$f(x)$</span> where <span class="math">$f$</span> is an arbitrary program is still the same idea: sample every variable in the program, compute the output, and repeat for many samples. <span class="math">$f$</span> can be a neural network where all of the parameters are probabilistic, or it can be an ODE solver with probabilistic parameters.</p>
<h2>Quick Example</h2>
<p>Let&#39;s do a quick example with the Lotka-Volterra equations. Recall that this is the ordinary differential equation defined by the following system:</p>


<pre class='hljl'>
<span class='hljl-k'>using</span><span class='hljl-t'> </span><span class='hljl-n'>OrdinaryDiffEq</span><span class='hljl-p'>,</span><span class='hljl-t'> </span><span class='hljl-n'>Plots</span><span class='hljl-t'>
</span><span class='hljl-n'>f1</span><span class='hljl-t'> </span><span class='hljl-oB'>=</span><span class='hljl-t'> </span><span class='hljl-k'>function</span><span class='hljl-t'> </span><span class='hljl-p'>(</span><span class='hljl-n'>du</span><span class='hljl-p'>,</span><span class='hljl-n'>u</span><span class='hljl-p'>,</span><span class='hljl-n'>p</span><span class='hljl-p'>,</span><span class='hljl-n'>t</span><span class='hljl-p'>)</span><span class='hljl-t'>
  </span><span class='hljl-n'>du</span><span class='hljl-p'>[</span><span class='hljl-ni'>1</span><span class='hljl-p'>]</span><span class='hljl-t'> </span><span class='hljl-oB'>=</span><span class='hljl-t'> </span><span class='hljl-n'>p</span><span class='hljl-p'>[</span><span class='hljl-ni'>1</span><span class='hljl-p'>]</span><span class='hljl-t'> </span><span class='hljl-oB'>*</span><span class='hljl-t'> </span><span class='hljl-n'>u</span><span class='hljl-p'>[</span><span class='hljl-ni'>1</span><span class='hljl-p'>]</span><span class='hljl-t'> </span><span class='hljl-oB'>-</span><span class='hljl-t'> </span><span class='hljl-n'>p</span><span class='hljl-p'>[</span><span class='hljl-ni'>2</span><span class='hljl-p'>]</span><span class='hljl-t'> </span><span class='hljl-oB'>*</span><span class='hljl-t'> </span><span class='hljl-n'>u</span><span class='hljl-p'>[</span><span class='hljl-ni'>1</span><span class='hljl-p'>]</span><span class='hljl-oB'>*</span><span class='hljl-n'>u</span><span class='hljl-p'>[</span><span class='hljl-ni'>2</span><span class='hljl-p'>]</span><span class='hljl-t'>
  </span><span class='hljl-n'>du</span><span class='hljl-p'>[</span><span class='hljl-ni'>2</span><span class='hljl-p'>]</span><span class='hljl-t'> </span><span class='hljl-oB'>=</span><span class='hljl-t'> </span><span class='hljl-oB'>-</span><span class='hljl-n'>p</span><span class='hljl-p'>[</span><span class='hljl-ni'>3</span><span class='hljl-p'>]</span><span class='hljl-t'> </span><span class='hljl-oB'>*</span><span class='hljl-t'> </span><span class='hljl-n'>u</span><span class='hljl-p'>[</span><span class='hljl-ni'>2</span><span class='hljl-p'>]</span><span class='hljl-t'> </span><span class='hljl-oB'>+</span><span class='hljl-t'> </span><span class='hljl-n'>p</span><span class='hljl-p'>[</span><span class='hljl-ni'>4</span><span class='hljl-p'>]</span><span class='hljl-t'> </span><span class='hljl-oB'>*</span><span class='hljl-t'> </span><span class='hljl-n'>u</span><span class='hljl-p'>[</span><span class='hljl-ni'>1</span><span class='hljl-p'>]</span><span class='hljl-oB'>*</span><span class='hljl-n'>u</span><span class='hljl-p'>[</span><span class='hljl-ni'>2</span><span class='hljl-p'>]</span><span class='hljl-t'>
</span><span class='hljl-k'>end</span><span class='hljl-t'>
</span><span class='hljl-n'>θ</span><span class='hljl-t'> </span><span class='hljl-oB'>=</span><span class='hljl-t'> </span><span class='hljl-p'>[</span><span class='hljl-nfB'>1.5</span><span class='hljl-p'>,</span><span class='hljl-nfB'>1.0</span><span class='hljl-p'>,</span><span class='hljl-nfB'>3.0</span><span class='hljl-p'>,</span><span class='hljl-nfB'>1.0</span><span class='hljl-p'>]</span><span class='hljl-t'>
</span><span class='hljl-n'>u0</span><span class='hljl-t'> </span><span class='hljl-oB'>=</span><span class='hljl-t'> </span><span class='hljl-p'>[</span><span class='hljl-nfB'>1.0</span><span class='hljl-p'>;</span><span class='hljl-nfB'>1.0</span><span class='hljl-p'>]</span><span class='hljl-t'>
</span><span class='hljl-n'>tspan</span><span class='hljl-t'> </span><span class='hljl-oB'>=</span><span class='hljl-t'> </span><span class='hljl-p'>(</span><span class='hljl-nfB'>0.0</span><span class='hljl-p'>,</span><span class='hljl-nfB'>10.0</span><span class='hljl-p'>)</span><span class='hljl-t'>
</span><span class='hljl-n'>prob1</span><span class='hljl-t'> </span><span class='hljl-oB'>=</span><span class='hljl-t'> </span><span class='hljl-nf'>ODEProblem</span><span class='hljl-p'>(</span><span class='hljl-n'>f1</span><span class='hljl-p'>,</span><span class='hljl-n'>u0</span><span class='hljl-p'>,</span><span class='hljl-n'>tspan</span><span class='hljl-p'>,</span><span class='hljl-n'>θ</span><span class='hljl-p'>)</span><span class='hljl-t'>
</span><span class='hljl-n'>sol</span><span class='hljl-t'> </span><span class='hljl-oB'>=</span><span class='hljl-t'> </span><span class='hljl-nf'>solve</span><span class='hljl-p'>(</span><span class='hljl-n'>prob1</span><span class='hljl-p'>,</span><span class='hljl-nf'>Tsit5</span><span class='hljl-p'>())</span><span class='hljl-t'>
</span><span class='hljl-nf'>plot</span><span class='hljl-p'>(</span><span class='hljl-n'>sol</span><span class='hljl-p'>)</span>
</pre>


<img src=""  />

<p>Now let&#39;s assume that the <code>θ</code>&#39;s are random. With Julia we can make variables into random variables by using Distributions.jl:</p>


<pre class='hljl'>
<span class='hljl-k'>using</span><span class='hljl-t'> </span><span class='hljl-n'>Distributions</span><span class='hljl-t'>
</span><span class='hljl-n'>θ</span><span class='hljl-t'> </span><span class='hljl-oB'>=</span><span class='hljl-t'> </span><span class='hljl-p'>[</span><span class='hljl-nf'>Uniform</span><span class='hljl-p'>(</span><span class='hljl-nfB'>0.5</span><span class='hljl-p'>,</span><span class='hljl-nfB'>1.5</span><span class='hljl-p'>),</span><span class='hljl-nf'>Beta</span><span class='hljl-p'>(</span><span class='hljl-ni'>5</span><span class='hljl-p'>,</span><span class='hljl-ni'>1</span><span class='hljl-p'>),</span><span class='hljl-nf'>Normal</span><span class='hljl-p'>(</span><span class='hljl-ni'>3</span><span class='hljl-p'>,</span><span class='hljl-nfB'>0.5</span><span class='hljl-p'>),</span><span class='hljl-nf'>Gamma</span><span class='hljl-p'>(</span><span class='hljl-ni'>5</span><span class='hljl-p'>,</span><span class='hljl-ni'>2</span><span class='hljl-p'>)]</span>
</pre>


<pre class="output">
4-element Array&#123;Distributions.Distribution&#123;Distributions.Univariate,Distrib
utions.Continuous&#125;,1&#125;:
 Distributions.Uniform&#123;Float64&#125;&#40;a&#61;0.5, b&#61;1.5&#41;
 Distributions.Beta&#123;Float64&#125;&#40;α&#61;5.0, β&#61;1.0&#41;
 Distributions.Normal&#123;Float64&#125;&#40;μ&#61;3.0, σ&#61;0.5&#41;
 Distributions.Gamma&#123;Float64&#125;&#40;α&#61;5.0, θ&#61;2.0&#41;
</pre>


<p>from which we can sample points and propogate through the solver:</p>


<pre class='hljl'>
<span class='hljl-n'>_θ</span><span class='hljl-t'> </span><span class='hljl-oB'>=</span><span class='hljl-t'> </span><span class='hljl-n'>rand</span><span class='hljl-oB'>.</span><span class='hljl-p'>(</span><span class='hljl-n'>θ</span><span class='hljl-p'>)</span><span class='hljl-t'>
</span><span class='hljl-n'>prob1</span><span class='hljl-t'> </span><span class='hljl-oB'>=</span><span class='hljl-t'> </span><span class='hljl-nf'>ODEProblem</span><span class='hljl-p'>(</span><span class='hljl-n'>f1</span><span class='hljl-p'>,</span><span class='hljl-n'>u0</span><span class='hljl-p'>,</span><span class='hljl-n'>tspan</span><span class='hljl-p'>,</span><span class='hljl-n'>_θ</span><span class='hljl-p'>)</span><span class='hljl-t'>
</span><span class='hljl-n'>sol</span><span class='hljl-t'> </span><span class='hljl-oB'>=</span><span class='hljl-t'> </span><span class='hljl-nf'>solve</span><span class='hljl-p'>(</span><span class='hljl-n'>prob1</span><span class='hljl-p'>,</span><span class='hljl-nf'>Tsit5</span><span class='hljl-p'>())</span><span class='hljl-t'>
</span><span class='hljl-nf'>plot</span><span class='hljl-p'>(</span><span class='hljl-n'>sol</span><span class='hljl-p'>)</span>
</pre>


<img src=""  />

<p>and from which we can get an ensemble of solutions:</p>


<pre class='hljl'>
<span class='hljl-n'>prob_func</span><span class='hljl-t'> </span><span class='hljl-oB'>=</span><span class='hljl-t'> </span><span class='hljl-k'>function</span><span class='hljl-t'> </span><span class='hljl-p'>(</span><span class='hljl-n'>prob</span><span class='hljl-p'>,</span><span class='hljl-n'>i</span><span class='hljl-p'>,</span><span class='hljl-n'>repeat</span><span class='hljl-p'>)</span><span class='hljl-t'>
  </span><span class='hljl-nf'>remake</span><span class='hljl-p'>(</span><span class='hljl-n'>prob</span><span class='hljl-p'>,</span><span class='hljl-n'>p</span><span class='hljl-oB'>=</span><span class='hljl-n'>rand</span><span class='hljl-oB'>.</span><span class='hljl-p'>(</span><span class='hljl-n'>θ</span><span class='hljl-p'>))</span><span class='hljl-t'>
</span><span class='hljl-k'>end</span><span class='hljl-t'>
</span><span class='hljl-n'>ensemble_prob</span><span class='hljl-t'> </span><span class='hljl-oB'>=</span><span class='hljl-t'> </span><span class='hljl-nf'>EnsembleProblem</span><span class='hljl-p'>(</span><span class='hljl-nf'>ODEProblem</span><span class='hljl-p'>(</span><span class='hljl-n'>f1</span><span class='hljl-p'>,</span><span class='hljl-n'>u0</span><span class='hljl-p'>,</span><span class='hljl-n'>tspan</span><span class='hljl-p'>,</span><span class='hljl-n'>θ</span><span class='hljl-p'>),</span><span class='hljl-t'>
                                </span><span class='hljl-n'>prob_func</span><span class='hljl-t'> </span><span class='hljl-oB'>=</span><span class='hljl-t'> </span><span class='hljl-n'>prob_func</span><span class='hljl-p'>)</span><span class='hljl-t'>
</span><span class='hljl-n'>sol</span><span class='hljl-t'> </span><span class='hljl-oB'>=</span><span class='hljl-t'> </span><span class='hljl-nf'>solve</span><span class='hljl-p'>(</span><span class='hljl-n'>ensemble_prob</span><span class='hljl-p'>,</span><span class='hljl-nf'>Tsit5</span><span class='hljl-p'>(),</span><span class='hljl-nf'>EnsembleThreads</span><span class='hljl-p'>(),</span><span class='hljl-n'>trajectories</span><span class='hljl-oB'>=</span><span class='hljl-ni'>1000</span><span class='hljl-p'>)</span><span class='hljl-t'>

</span><span class='hljl-k'>using</span><span class='hljl-t'> </span><span class='hljl-n'>DiffEqBase</span><span class='hljl-oB'>.</span><span class='hljl-n'>EnsembleAnalysis</span><span class='hljl-t'>
</span><span class='hljl-nf'>plot</span><span class='hljl-p'>(</span><span class='hljl-nf'>EnsembleSummary</span><span class='hljl-p'>(</span><span class='hljl-n'>sol</span><span class='hljl-p'>))</span>
</pre>


<img src=""  />

<p>From just a few variables having probabilities, every variable has an induced probability: there is a probability distribution on the integrator states, the output at time t_i, etc.</p>
<h2>Bayesian Estimation with Point Estimates: Bayes&#39; Rule, Maximum Likelihood, and MAP</h2>
<p>Recall from our previous studies that the difficult part of modeling is not necessarily the forward modeling approach, rather it&#39;s the incorporation of data or the estimation problem that is difficult. When your variables are now random distributions, how do you &quot;fit&quot; them?</p>
<p>The answer comes from Bayes&#39; rule, which is the following. Assume you had a prior distribution <span class="math">$p(\theta)$</span> for the probability that <span class="math">$X$</span> is a given value <span class="math">$\theta$</span>. Then the posterior probability distribution, <span class="math">$p(\theta|D)$</span>, or the distribution which is updated to include data, is given by:</p>
<p class="math">\[
p(\theta|D) = \frac{p(D|\theta)p(\theta)}{\int_\Omega p(D|\theta)p(\theta)d\theta}
\]</p>
<p>The scaling factor on the denominator is simply a constant to make the distribution integrate 1 &#40;so that the resulting function is a probability distribution&#33;&#41;. The numerator is simply the prior distribution multiplied by the likelihood of seeing the data given the value of the random variable. The prior distribution must be given but notice that the likelihood has another name: the likelihood is the model.</p>
<p>The reason why it&#39;s the same thing is because the model is what tells you the expected outcomes given a value of the random variable, and your data is on an expected outcome&#33; However, the likelihood encodes a little bit more information in that it again is a distribution and not a point estimate. We need to make a choice for our <em>measurement distribution</em> on our model&#39;s results.</p>
<h4>Quick Question: Why is this referred to as measurement noise? Why is it not process noise?</h4>
<p>A common choice for the measurement distribution is the Normal distribution. This comes from the Central Limit Theorem &#40;CLT&#41; which essentially states that, given enough interacting mechanisms, the average values of things &quot;tend to become normally distributed&quot;. The true statement of the CLT is much more complex, but that is a decent working definition for practical use. The normal distribution is defined by two parameters, <span class="math">$\mu$</span> and <span class="math">$\sigma$</span>, and is given by the following function:</p>
<p class="math">\[
f(x;\mu,\sigma) = \frac{1}{\sigma\sqrt{2\pi}}\exp(\frac{-(x-\mu)^2}{2\sigma^2})
\]</p>
<p>This is a bell curve centered at <span class="math">$\mu$</span> with a variance of <span class="math">$\sigma$</span>. Our best guess for the output, i.e. the model&#39;s prediction, should be the average measurement, meaning that <span class="math">$\mu$</span> is the result from the simulator. <span class="math">$\sigma$</span> is a parameter for how much measurement error we expect &#40;some intuition on <span class="math">$\sigma$</span> will come soon&#41;.</p>
<p>Let&#39;s return to thinking about the ODE example. In this case we have <span class="math">$\theta$</span> as a vector of random variables. This means that <span class="math">$u(t;\theta)$</span> is a random variable for the ODE <span class="math">$u'= ...$</span>&#39;s solution at a given point in time <span class="math">$t$</span>. If we have a measurement at a time <span class="math">$t_i$</span> and assume our measurement noise is normally distributed with some constant measurement noise <span class="math">$\sigma$</span>, then the likelihood of our data would be <span class="math">$f(x_i;u(t_i;\theta),\sigma)$</span> at each data point <span class="math">$(t_i,x_i)$</span>. From probability we know that seeing the composition of events is given by the multiplication of probabilities, so the probability of seeing the full dataset given observations <span class="math">$D = (t_i,x_i)$</span> along the timeseries is:</p>
<p class="math">\[
p(D|\theta) = \prod_i f(x_i;u(t_i;\theta),\sigma)
\]</p>
<p>This can be read as: solve the model with the given parameters, and the probability of having seen the measurement is thus given by a product of normal distribution calculations. Note that in many cases the product is not numerically stable &#40;and grows exponentially&#41;, and so the likelihood is transformed to the log-likelihood. To get this expression, we take the log of both sides and notice that the product becomes a summation, and thus:</p>
<p class="math">\[
\begin{align}
\log p(D|\theta) &= \sum_i \log f(x_i;u(t_i;\theta),\sigma)\\
                 &= \frac{N}{\sqrt{2\pi}\sigma} + \frac{1}{2\sigma^2} \sum_i -(x-\mu)^2
\end{align}
\]</p>
<p>Notice that <strong>maximizing this log-likelihood is equivalent to minimizing the L2 norm of the solution against the data&#33;</strong>. Thus we can see a few things:</p>
<ol>
<li><p>Previous parameter estimation by minimizing a norm against data can be seen as maximum likelihood with some measurement distribution. L2 norm corresponds to assuming measurement noise is normally distributed and all of the measurements have the same error variance.</p>
</li>
<li><p>By the same derivation, having different error variances with normally distributed errors is equivalent to doing weighted L2 estimation.</p>
</li>
</ol>
<p>This reformulation &#40;generalization?&#41; to likelihoods of probability distributions is known as <em>maximum likelihood estimation</em> &#40;MLE&#41;, but is equivalent to our previous forms of parameter estimation using point estimates against data. However, this calculation is ignoring Bayes&#39; rule, and is thus not finding the parameters which have the highest probability. To do that, we need to go back to Bayes&#39; rule which states that:</p>
<p class="math">\[
\log p(\theta|D) = \log p(D|\theta) + \log p(\theta) - C
\]</p>
<p>Thus, maximizing the log-likelihood is &quot;almost&quot; the same as finding the most probable parameters, except that we need to add weights given <span class="math">$\log p(\theta)$</span> from our prior distribution&#33; If we assume our prior distribution is flat, like a uniform distribution, then we have a <em>non-informative prior</em> and the maximum posterior point matches that of the maximum likelihood estimation. However, this formulation allows us to get point estimates in a way that takes into account prior knowledge, and is call <em>maximum a posteriori estimation</em> &#40;MAP&#41;.</p>
<h2>Bayesian Estimation of Posterior Distributions with Monte Carlo</h2>
<p>The previous discussion still solely focused on getting point estimates for the most probable parameters. However, what if we wanted to find the distributions of the parameters, i.e. the full <span class="math">$p(D|\theta)$</span>? Outside of very few small models, this cannot be done analytically and is thus the basic problem of probabilistic programming. There are two general approaches:</p>
<ol>
<li><p>Sampling-based approaches. Sample parameters <span class="math">$\theta_i$</span> in such a manner that the array <span class="math">$[\theta_i]$</span> converges to an array sampled from the true distribution, and thus with enough samples one can capture the distribution numerically.</p>
</li>
<li><p>Variational inference. Find some way to represent the probability distribution and push forward the distributions at every step of the program.</p>
</li>
</ol>
<h4>Recovering Distributions from Sampled Points</h4>
<p>It&#39;s clear from above that if you have a distribution, like <code>Normal&#40;5,1&#41;</code>, that you can sample from the distribution to get an array of values which follow the distribution. However, in order for the following sampling approaches to make sense, we need to see how to recover a distribution from discrete samples. So let&#39;s say you had a bunch of normally distributed points:</p>


<pre class='hljl'>
<span class='hljl-n'>X</span><span class='hljl-t'> </span><span class='hljl-oB'>=</span><span class='hljl-t'> </span><span class='hljl-nf'>Normal</span><span class='hljl-p'>(</span><span class='hljl-ni'>5</span><span class='hljl-p'>,</span><span class='hljl-ni'>1</span><span class='hljl-p'>)</span><span class='hljl-t'>
</span><span class='hljl-n'>x</span><span class='hljl-t'> </span><span class='hljl-oB'>=</span><span class='hljl-t'> </span><span class='hljl-p'>[</span><span class='hljl-nf'>rand</span><span class='hljl-p'>(</span><span class='hljl-n'>X</span><span class='hljl-p'>)</span><span class='hljl-t'> </span><span class='hljl-k'>for</span><span class='hljl-t'> </span><span class='hljl-n'>i</span><span class='hljl-t'> </span><span class='hljl-kp'>in</span><span class='hljl-t'> </span><span class='hljl-ni'>1</span><span class='hljl-oB'>:</span><span class='hljl-ni'>100</span><span class='hljl-p'>]</span><span class='hljl-t'>
</span><span class='hljl-nf'>scatter</span><span class='hljl-p'>(</span><span class='hljl-n'>x</span><span class='hljl-p'>,[</span><span class='hljl-ni'>1</span><span class='hljl-t'> </span><span class='hljl-k'>for</span><span class='hljl-t'> </span><span class='hljl-n'>i</span><span class='hljl-t'> </span><span class='hljl-kp'>in</span><span class='hljl-t'> </span><span class='hljl-ni'>1</span><span class='hljl-oB'>:</span><span class='hljl-ni'>100</span><span class='hljl-p'>])</span>
</pre>


<img src=""  />

<p>Notice that there are more points in the areas of higher probability. Thus the density of sampled points gives us an estimate for the probability of having points in a given area. We can then count the number of points in a bin and divide by the total number of points in order to get the probability of being in a specific region. This is depicted by a histogram:</p>


<pre class='hljl'>
<span class='hljl-nf'>histogram</span><span class='hljl-p'>(</span><span class='hljl-n'>x</span><span class='hljl-p'>)</span>
</pre>


<img src=""  />

<p>and we see this converges when we get more points:</p>


<pre class='hljl'>
<span class='hljl-nf'>histogram</span><span class='hljl-p'>([</span><span class='hljl-nf'>rand</span><span class='hljl-p'>(</span><span class='hljl-n'>X</span><span class='hljl-p'>)</span><span class='hljl-t'> </span><span class='hljl-k'>for</span><span class='hljl-t'> </span><span class='hljl-n'>i</span><span class='hljl-t'> </span><span class='hljl-kp'>in</span><span class='hljl-t'> </span><span class='hljl-ni'>1</span><span class='hljl-oB'>:</span><span class='hljl-ni'>10000</span><span class='hljl-p'>],</span><span class='hljl-n'>normed</span><span class='hljl-oB'>=</span><span class='hljl-kc'>true</span><span class='hljl-p'>)</span><span class='hljl-t'>
</span><span class='hljl-k'>using</span><span class='hljl-t'> </span><span class='hljl-n'>StatsPlots</span><span class='hljl-t'>
</span><span class='hljl-nf'>plot!</span><span class='hljl-p'>(</span><span class='hljl-n'>X</span><span class='hljl-p'>,</span><span class='hljl-n'>lw</span><span class='hljl-oB'>=</span><span class='hljl-ni'>5</span><span class='hljl-p'>)</span>
</pre>


<img src=""  />

<p>A continuous form of this is the <em>kernel density estimate</em>, which is essentially a smoothed binning approach.</p>


<pre class='hljl'>
<span class='hljl-k'>using</span><span class='hljl-t'> </span><span class='hljl-n'>KernelDensity</span><span class='hljl-t'>
</span><span class='hljl-nf'>plot</span><span class='hljl-p'>(</span><span class='hljl-nf'>kde</span><span class='hljl-p'>([</span><span class='hljl-nf'>rand</span><span class='hljl-p'>(</span><span class='hljl-n'>X</span><span class='hljl-p'>)</span><span class='hljl-t'> </span><span class='hljl-k'>for</span><span class='hljl-t'> </span><span class='hljl-n'>i</span><span class='hljl-t'> </span><span class='hljl-kp'>in</span><span class='hljl-t'> </span><span class='hljl-ni'>1</span><span class='hljl-oB'>:</span><span class='hljl-ni'>10000</span><span class='hljl-p'>]),</span><span class='hljl-n'>lw</span><span class='hljl-oB'>=</span><span class='hljl-ni'>5</span><span class='hljl-p'>)</span><span class='hljl-t'>
</span><span class='hljl-nf'>plot!</span><span class='hljl-p'>(</span><span class='hljl-n'>X</span><span class='hljl-p'>,</span><span class='hljl-n'>lw</span><span class='hljl-oB'>=</span><span class='hljl-ni'>5</span><span class='hljl-p'>)</span>
</pre>


<img src=""  />

<p>Thus, for the sampling-based approaches, we simply need to arrive at an array which is sampled according to the distribution that we want to estimate, and from that array we can recover the distribution.</p>
<h3>Sampling Distributions with the Metropolis Hastings Algorithm</h3>
<p>The Metropolis-Hastings algorithm is the simplest form of <em>Markov Chain Monte Carlo</em> &#40;MCMC&#41; which gives a way of sampling the <span class="math">$\theta$</span> distribution. To see how this algorithm works, let&#39;s understand the ratio between two points in the posterior probability. If we have <span class="math">$x_i$</span> and <span class="math">$x_j$</span>, the ratio of the two probabilities would be given by:</p>
<p class="math">\[
\frac{p(x_i|D)}{p(x_j|D)} = \frac{p(D|x_i)p(x_i)}{p(D|x_j)p(x_j)}
\]</p>
<p>&#40;notice that the integration constant cancels&#41;. This motivates the idea that all we have to do is ensure we only go to a point <span class="math">$x_j$</span> from <span class="math">$x_i$</span> with probability difference that matches that ratio, and over time if we do this between &quot;all points&quot; we will have the right number of &quot;each point&quot; in the distribution &#40;quotes because it&#39;s continuous&#41;. With a bit more rigour we arrive at the following algorithm:</p>
<ol>
<li><p>Starting at <span class="math">$x_i$</span>, take <span class="math">$x_{i+1}$</span> from a sampling algorithm <span class="math">$g(x_{i+1}|x_i)$</span>.</p>
</li>
<li><p>Calculate <span class="math">$A = \min\left(1,\frac{p(D|x_{i+1})p(x_{i+1})g(x_i|x_{i+1})}{p(D|x_i)p(x_i)g(x_{i+1}|x_i)}\right)$</span>. Notice that if we require <span class="math">$g$</span> to be symmetric, then this simplies to the probability ratio <span class="math">$A = \min\left(1,\frac{p(D|x_{i+1})p(x_{i+1})}{p(D|x_i)p(x_i)}\right)$</span></p>
</li>
<li><p>Use a random number to accept the step with a probability <span class="math">$A$</span>. Go back to step 1, incrementing <span class="math">$i$</span> if accepted, otherwise just repeat.</p>
</li>
</ol>
<p>I.e, we just walk around the space biasing the acceptance of a step by the factor <span class="math">$\frac{p(x_i|D)}{p(x_j|D)}$</span> and sooner or later we will have spent the right amount of time in each area, giving the correct distribution.</p>
<p>&#40;This can be rigorously proven, and those details are left out.&#41;</p>
<h3>The Cost of Bayesian Estimation</h3>
<p>Let&#39;s take a quick moment to understand the high cost of Bayesian posterior estimations. While before we were getting point estimates, now we are trying to recover a full probability distribution, and each accept/reject probability calculation requires evaluating the likelihood at some point. Remember, the likelihood is generated by our simulator, and thus every evaluation here is an ODE solver call or a neural network forward pass&#33; This means that to get good distributions, we are solving the ODE hundreds of thousands of times, i.e. even more than when doing parameter estimation&#33; This is something to keep in mind.</p>
<p>However, notice that this process is trivially parallelizable. We can just have <em>parallel chains</em> on going, i.e. start 16 processes all doing Metropolis-Hastings, and in the end they are all sampling from the same distribution, so the final array can simply be the pooled results of each chain.</p>
<h3>Hamiltonian Monte Carlo</h3>
<p>Metropolis-Hastings is easy to motivate and implement. However, it does not do well in high dimensional spaces because it searches in all directions. For example, it&#39;s common for the sampling distribution <span class="math">$g$</span> to be a multivariable distribution &#40;i.e. normal in all directions&#41;. However, high dimensional objects commonly sit on low dimensional manifolds &#40;known as the <em>manifold hypothesis</em>&#41;. If that&#39;s the case, the most probable set of parameters is something that is low dimensional. For example, parameters may compensate for one another, and so <span class="math">$\theta_1^2 + \theta_2^2 + \theta_3^2 = 1$</span> might be the manifold on which all of the most probable choices for <span class="math">$\theta$</span> lie, in which case we need sample on the sphere instead of all of <span class="math">$\mathbb{R}^3$</span>.</p>
<p>However, it&#39;s quick to see that this will give Metropolis-Hastings some trouble, since it will use a normal distribution around the current point, and thus even if we start on the sphere, it will have a high chance of trying a point not on the sphere in the next round&#33; This can be depicted as:</p>
<p><img src="https://user-images.githubusercontent.com/1814174/69541382-fe85b100-0f56-11ea-8852-ba2044084d43.PNG" alt="" /></p>
<p>Recall that every single rejection is still evaluating the likelihood &#40;since it&#39;s calculating an acceptance probability, finding it near zero, rejecting and starting again&#41;, and every likelihood call is calling our simulator, and so this is sllllllooooooooooooooowwwwwwwww in high dimensions&#33;</p>
<p>What we need to do instead is ensure that we walk along the path of high probability. What we want to do is thus build a vector field that matches our high probability regions</p>
<p><img src="https://user-images.githubusercontent.com/1814174/69541530-5ae8d080-0f57-11ea-9673-36affc62d315.PNG" alt="" /></p>
<p>and follow said vector field &#40;following a vector field is solving what kind of equation?&#41;. The first idea one might have is to use the gradient. However, while this idea has the right intentions, the issue is that the gradient of the probability will average out all of the possible probabilities, and will thus flow towards the mode of the distribution:</p>
<p><img src="https://user-images.githubusercontent.com/1814174/69541683-bd41d100-0f57-11ea-9356-dc2f771cca7d.PNG" alt="" /></p>
<p>To overcome this issue, we look to physical systems and see that a satelite orbiting a planet always nicely stays on some manifold instead of following the gradient:</p>
<p><img src="https://user-images.githubusercontent.com/1814174/69541780-fed27c00-0f57-11ea-913f-6b7135ad5fe4.PNG" alt="" /></p>
<p>The reason why it does is because it has momentum. Recall from basic physics that one way to describe a physical system is through <em>Hamiltonian mechanics</em>, where <span class="math">$H(x,p)$</span> is the energy associated with the state <span class="math">$(x,p)$</span> &#40;normally <span class="math">$x$</span> is location and <span class="math">$p$</span> is momentum&#41;. Due to conservation of energy, the solution of the dynamical equations leads to <span class="math">$H(x,p)$</span> being constant, and thus the dynamics follow the <em>level sets</em> of <span class="math">$H$</span>. From the Hamiltonian the dynamics of the system are:</p>
<p class="math">\[
\begin{align}
\frac{dx}{dt} &=  \frac{dH}{dp}\\
              &= -\frac{dH}{dx}
\end{align}
\]</p>
<p>Here we want our Hamiltonian to be our posterior probability, so that way we stay on the manifold of high probability. This means:</p>
<p class="math">\[
H(x,p) = - \log \pi(x,p)
\]</p>
<p>where <span class="math">$\pi(x,p) = \pi(p|x)\pi(x)$</span> &#40;where I am now using <span class="math">$pi$</span> for probability since <span class="math">$p$</span> is momentum&#33;&#41;. So to lift from a probability over parameters to one thtat includes momentum, we simply need to choose a conditional distribution <span class="math">$\pi(p|x)$</span>. This would mean that</p>
<p class="math">\[
\begin{align}
H(x,p) &= -log \pi(p|x) - \log \pi(x)\\
       &= K(p,x) + V(x)
\end{align}
\]</p>
<p>where <span class="math">$K$</span> is the kinetic energy and <span class="math">$V$</span> is the potential. Thus the potential energy is directly given by the posterior calcuation, and the kinetic energy is thus a choice that is used to build the correct Hamiltonian. Hamiltonian Monte Carlo methods then dig into good ways to choose the kinetic energy function. This is done at the start &#40;along with the choice of ODE solver time step&#41; in such a way that it maximizes acceptance probabilities.</p>
<h3>Connections to Differentiable Programming</h3>
<p class="math">\[
-\frac{dH}{dx}
\]</p>
<p>requires calculating the gradient of the likelihood function with respect to the parameters, so we are once again using the gradient of our simulator&#33; This means that all of our previous discussion on automatic differentiation and differentiable programming applies to the Hamiltonian Monte Carlo context.</p>
<p>There&#39;s another thread to follow that transformations of probability distributions are pushforwards of the Jacobian transformations &#40;given the transformation of an integral formula&#41;, and this is used when doing variational inference.</p>
<h3>Symplectic and Geometric Integration</h3>
<p>One way to integrate the system of ODEs which result from the Hamiltonian system is to convert it to a system of first order ODEs and solve it directly. However, this loses information and can result in drift. This is demonstrated by looking at the long time solution of the pendulum:</p>


<pre class='hljl'>
<span class='hljl-k'>using</span><span class='hljl-t'> </span><span class='hljl-n'>ParameterizedFunctions</span><span class='hljl-t'>
</span><span class='hljl-n'>u0</span><span class='hljl-t'> </span><span class='hljl-oB'>=</span><span class='hljl-t'> </span><span class='hljl-p'>[</span><span class='hljl-nfB'>1.</span><span class='hljl-p'>,</span><span class='hljl-nfB'>0.</span><span class='hljl-p'>]</span><span class='hljl-t'>
</span><span class='hljl-n'>harmonic!</span><span class='hljl-t'> </span><span class='hljl-oB'>=</span><span class='hljl-t'> </span><span class='hljl-nd'>@ode_def</span><span class='hljl-t'> </span><span class='hljl-n'>HarmonicOscillator</span><span class='hljl-t'> </span><span class='hljl-k'>begin</span><span class='hljl-t'>
   </span><span class='hljl-n'>dv</span><span class='hljl-t'> </span><span class='hljl-oB'>=</span><span class='hljl-t'> </span><span class='hljl-oB'>-</span><span class='hljl-n'>x</span><span class='hljl-t'>
   </span><span class='hljl-n'>dx</span><span class='hljl-t'> </span><span class='hljl-oB'>=</span><span class='hljl-t'> </span><span class='hljl-n'>v</span><span class='hljl-t'>
</span><span class='hljl-k'>end</span><span class='hljl-t'>
</span><span class='hljl-n'>tspan</span><span class='hljl-t'> </span><span class='hljl-oB'>=</span><span class='hljl-t'> </span><span class='hljl-p'>(</span><span class='hljl-nfB'>0.0</span><span class='hljl-p'>,</span><span class='hljl-nfB'>10.0</span><span class='hljl-p'>)</span><span class='hljl-t'>
</span><span class='hljl-n'>tspan</span><span class='hljl-t'> </span><span class='hljl-oB'>=</span><span class='hljl-t'> </span><span class='hljl-p'>(</span><span class='hljl-nfB'>0.0</span><span class='hljl-p'>,</span><span class='hljl-nfB'>10000.0</span><span class='hljl-p'>)</span><span class='hljl-t'>
</span><span class='hljl-n'>prob</span><span class='hljl-t'> </span><span class='hljl-oB'>=</span><span class='hljl-t'> </span><span class='hljl-nf'>ODEProblem</span><span class='hljl-p'>(</span><span class='hljl-n'>harmonic!</span><span class='hljl-p'>,</span><span class='hljl-n'>u0</span><span class='hljl-p'>,</span><span class='hljl-n'>tspan</span><span class='hljl-p'>)</span><span class='hljl-t'>
</span><span class='hljl-n'>sol</span><span class='hljl-t'> </span><span class='hljl-oB'>=</span><span class='hljl-t'> </span><span class='hljl-nf'>solve</span><span class='hljl-p'>(</span><span class='hljl-n'>prob</span><span class='hljl-p'>,</span><span class='hljl-nf'>Tsit5</span><span class='hljl-p'>())</span><span class='hljl-t'>
</span><span class='hljl-nf'>gr</span><span class='hljl-p'>(</span><span class='hljl-n'>fmt</span><span class='hljl-oB'>=:</span><span class='hljl-n'>png</span><span class='hljl-p'>)</span><span class='hljl-t'> </span><span class='hljl-cs'># Make it a PNG instead of an SVG since there&#39;s a lot of points!</span><span class='hljl-t'>
</span><span class='hljl-nf'>plot</span><span class='hljl-p'>(</span><span class='hljl-n'>sol</span><span class='hljl-p'>,</span><span class='hljl-n'>vars</span><span class='hljl-oB'>=</span><span class='hljl-p'>(</span><span class='hljl-ni'>1</span><span class='hljl-p'>,</span><span class='hljl-ni'>2</span><span class='hljl-p'>))</span>
</pre>


<img src=""  />


<pre class='hljl'>
<span class='hljl-nf'>plot</span><span class='hljl-p'>(</span><span class='hljl-n'>sol</span><span class='hljl-p'>)</span>
</pre>


<img src=""  />

<p>What is an oscilatory system slowly loses energy and falls inward towards the center. To avoid this issue, we can do a few things:</p>
<ol>
<li><p>Project back to the manifold after steps. That can be costly &#40;but almost might only need to happen every once in awhile&#33;&#41;</p>
</li>
<li><p>Use a symplectic integrator.</p>
</li>
</ol>
<p>A <em>symplectic integrator</em> is an integrator who&#39;s solution lives on a symplectic manifold, i.e. it preserves area in in the <span class="math">$(x,p)$</span> ellipses as it numerically approximates the flow. This means that:</p>
<ul>
<li><p>Long-time integrations are truly cyclic with only floating point drift.</p>
</li>
<li><p>Steps preserve area. In the sense of Hamiltonian Monte Carlo, this means preserve probability and thus increase the acceptance rate.</p>
</li>
</ul>
<p>These properties are demonstrated in <a href="https://tutorials.juliadiffeq.org/html/models/05-kepler_problem.html">the Kepler problem demo</a>. However, note that while the solution lives on a symplectic manifold, it isn&#39;t necessarily the correct symplectic manifold. The shift in the manifold is <span class="math">$\mathcal{O}(\Delta t^k)$</span> where <span class="math">$k$</span> is the order of the method. For more information on symplectic integration, consult <a href="https://scicomp.stackexchange.com/a/29154/18981">this StackOverflow response which goes into depth</a>.</p>
<h3>Application: Bayesian Estimation of Differential Equation Parameters</h3>
<p>For a full demo of probabilistic programming on a differential equation system, see <a href="https://tutorials.juliadiffeq.org/html/models/06-pendulum_bayesian_inference.html">this tutorial on Bayesian inference of pendulum parameteres</a> utilizing DifferentialEquations.jl and DiffEqBayes.jl.</p>
<h2>Bayesian Estimation of Posterior Distributions with Variational Inference</h2>
<p>Instead of using sampling, one can use variational inference to push through probability distributions. There are many ways to do variational inference, but a lot of the methods can be very model-specific. However, a recent change to probabilistic programming has been the development of <em>Automatic Differentiation Variational Inference &#40;ADVI&#41;</em>: a general variational inference method which is not model-specific and instead uses AD. This has allowed for large expensive models to get effective distributional estimation, something that wasn&#39;t previously possible with HMC. In this section we will build up this methodology and understand its performance characteristics.</p>
<h3>ADVI as Optimization</h3>
<p>In this form of variational inference, we wish to directly estimate the posterior distribution. To do so, we pick a functional form to represent the solution <span class="math">$q(\theta; \phi)$</span> where <span class="math">$\phi$</span> are latent variables. We want our resulting distribution to fit the posterior, and tus we enforce that:</p>
<p class="math">\[
\phi^\ast = \text{argmin}_{\phi} \text{KL} \left(q(\theta; \phi) \Vert p(\theta | D)\right)
\]</p>
<p>where KL is the KL-divergence. KL-divergence is a distance function over probability distributions, and so this is simply a cost function over the distance between a chosen distribution and a desired distribution, where when <span class="math">$\phi$</span> are good we will have <span class="math">$q$</span> as a good approximation to the posterior.</p>
<p>However, the KL divergence lacks an analytical form because it requires knowing the posterior, the quantity we are trying to numerically estimate. However, it turns out that we can instead maximize the <em>Evidence Lower Bound &#40;ELBO&#41;</em>:</p>
<p class="math">\[
\mathcal{L}(\phi) = \mathbb{E}_{q}[\log p(x,\theta)] - \mathbb{E}_q [\log q(\theta; \phi)]
\]</p>
<p>The ELBO is equivalent to the negative KL divergence up to a constant <span class="math">$\log p(x)$</span>, which means that maximizing this is equivalent to minimizing the KL divergence.</p>
<p>One last detail is neccessary in order for this problem to be tractible. To know the set of possible values to optimize over, we assume that the support of <span class="math">$q$</span> is a subset of the support of the prior. This means that our prior has to cover the probability distribution, which makes sense and matches Cromwell&#39;s rule for MCMC.</p>
<p>At this point we now assume that <span class="math">$q$</span> is Gaussian. When we rewrite the ELBO in terms of the standard Gaussian, we receive an expectation that is automatically differentiable. Calculating gradients is thus done with AD. Using only one or a few solves gives a noisy gradient to sample and optimize the latent variables to hone in on latent variables.</p>
<h2>A Note on Implementation of Optimization for Probabilistic Programming</h2>
<p>Variable domains can be constrained. For example, you may require a positive value. This can be handled by a transformation. For example, if <span class="math">$y$</span> must be positive, then one can optimize implicitly using <span class="math">$\exp(y)$</span> at every point, this allowing <span class="math">$y$</span> to be any real value with then <span class="math">$\exp(y)$</span> is positive. This turns the problem into an unconstrained optimization over the real numbers, and similar transformations can be done with any of the standard probability distribution&#39;s support function.</p>
<h4>Citation</h4>
<p>For Hamiltonian Monte Carlo, the images were taken from <a href="https://arxiv.org/pdf/1701.02434.pdf">A Conceptual Introduction to Hamiltonian Monte Carlo</a> by Michael Betancourt.</p>


        <HR/>
        <div class="footer">
          <p>
            Published from <a href="probabilistic_programming.jmd">probabilistic_programming.jmd</a>
            using <a href="http://github.com/JunoLab/Weave.jl">Weave.jl</a> v0.10.6 on 2020-12-12.
          </p>
        </div>
      </div>
    </div>
  </div>
</BODY>

</HTML>
